Introduction to Open Data Science - Course Project

About the project

The objective of the course is to become familiar with (R) programming, and it was recommended by one of my colleagues. I learned how to install and configure the software necessary for statistical programming. I faced some difficulties in syncing files with GitHub (sometimes it works and sometimes it gives me error messages). I hope the situation will be much easier going forward. I wish the course covered practical issues in statistical computing, R programming, reading data, and writing functions. The R for Health Data Science book is very simple and to the point. It takes some time to use Exercise Set 1 with the book, but I liked the idea of having all codes in R chunks indexed in the file. Some points in the “summarizing the data” chapter were difficult and took time to understand.

My GitHub repository: https://github.com/elmoazen/IODS-project .

My Link to course diary: https://elmoazen.github.io/IODS-project .

# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 13 00:48:45 2022"
1+1
## [1] 2

The text continues here.

Chapter 2 : Regression and model validation

Describe the work you have done this week and summarize your learning.

Access Libraries

# Access tidyverse package 
          library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
      # Access GGally library
          library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

2.1 Read Data Files

# Read the CSV file 
          learning2014<-read_csv("learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
          dim(learning2014) #Explore the dimensions of the data 
## [1] 166   7
          str(learning2014) #Explore the structure of the data 
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
  • The number of students participated in the survey is 183 students.
  • We have 7 variables in the datasheet:
    1. gender : for the gender of students categorical variable (110 females and 66 males)
    2. age : the age of the students numerical variable (mean age = 25.51 )
    3. attitude: numerical data (mean (3.143)
    4. deep: numerical variable for deep learning (mean= 3.680)
    5. stra: numerical variable for strategic learning (mean= 3.121 )
    6. surf: numerical variable for surface learning (mean= 2.787)
    7. points: exam points for students’ achievement(mean= 22.72)

2.2 Second Task Graphical Overview

       # graphical overview of the data.
          ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

       # a scatter plot of points versus attitude
           qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Attitude and Exam Points") + theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'

        # a scatter plot of points versus deep learning
            qplot(deep, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Deep Learning and Exam Points")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## `geom_smooth()` using formula = 'y ~ x'

       # a scatter plot of points versus Strategic learning
            qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Strategic Learning and Exam Points")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## `geom_smooth()` using formula = 'y ~ x'

       # a scatter plot of points versus surface learning
            qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Surface Learning and Exam Points")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## `geom_smooth()` using formula = 'y ~ x'

Summary of the variables

       # summaries of the variables in the data
           summary(learning2014)  
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
           table(learning2014$gender)
## 
##   F   M 
## 110  56

2.3 Choose three variables as explanatory variables and fit a regression model

            reg_Model <- lm(points ~ attitude + deep+ stra , data = learning2014)
        
            # print out a summary of the model
               summary(reg_Model)
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08
               coefficients(reg_Model)
## (Intercept)    attitude        deep        stra 
##  11.3914511   3.5253519  -0.7491996   0.9620827
 # the Best fitted model 
               fit_Model <- lm(points ~ attitude + stra , data = learning2014)

## The signficant fitted model 
               fit_Model2 <- lm(points ~ attitude , data = learning2014)

The only significant p value is attitude.

2.4 Fitted Model

# 4. Fourth Task Summary : of Fitted Model
               
                 # print out a summary of the Fitted model
               summary(fit_Model)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
               coefficients(fit_Model)
## (Intercept)    attitude        stra 
##   8.9728998   3.4658096   0.9136517

2.5 Diagnostic plots

# Fifth Task:   .
              
# draw diagnostic plots Residuals vs Fitted values 1, Normal QQ-plot 2 and Residuals vs Leverage 5

par(mfrow = c(2,2))
plot(fit_Model, which=c(1,2,5))

Assumption:

  1. Residual vs Fitted : This plot is used to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then we can assume that the residuals follow a linear pattern. In our model is almost horizontal so the linear regression model is appropriate for the dataset

  2. Normal Q-Q :This plot is used to determine if the residuals of the regression model are normally distributed. If the points in this plot fall roughly along a straight diagonal line, then we can assume the residuals are normally distributed. In our model most of the points are roughly along the diagonal line

  3. Residual vs Leverage : This plot is used to identify influential observations. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation.


Chapter 3: Logistic Regression

3.1 Create a new R Markdown file

# File 'chapter3.Rmd' is created and linked to index 
# access the tidyverse libraries tidyr, dplyr, ggplot2 
library(readr); library(dplyr); library(ggplot2); library(tidyr); library(gtsummary)

3.2 Read Data Files

# Read the CSV file 
alc<- read_csv("data/alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

About the data

This data is obtained from two secondary education schools in Portugal. it was collected by using school reports and questionnaires.

The data include the following :
  1. School name
  2. Demographic data (Sex, age, address)
  3. Socio-economic Status( family size, Parent status, Parents education and occupation, and guardians)
  4. Social Data (extra-curricular activities, romantic relationship, family relationship, free time and going out with friends)
  5. School related feature :(travel time to school, study time/week , extra support from school and family, extra-paid classe and abscences)
  6. Health status
  7. Alcohol consumption : Alcohol consumption in weekdays and weekends. With Average of consumption/week, and categorize them as high consumption (>2 times)
  8. Performance : It include past class failures and the students’ performance in Mathematics and Portuguese language in 3 period G1 and G2 correspond to the 1st and 2nd period grades and G3 for the final year grade.

3.3 Relation between Alcohol consumption and 4 Variables

# Create a logistic regression model
reg_model <- glm(high_use ~  sex +failures + absences + G3, data = alc, family = "binomial")

# print summary of the model
summary(reg_model)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1561  -0.8429  -0.5872   1.0033   2.1393  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.38733    0.51617  -2.688  0.00719 ** 
## sexM         1.00870    0.24798   4.068 4.75e-05 ***
## failures     0.50382    0.22018   2.288  0.02213 *  
## absences     0.09058    0.02322   3.901 9.56e-05 ***
## G3          -0.04671    0.03948  -1.183  0.23671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 405.59  on 365  degrees of freedom
## AIC: 415.59
## 
## Number of Fisher Scoring iterations: 4
# Odds Ratio(OR) computation from coefficients
OR<-coef(reg_model) %>% exp

# compute confidence intervals (CI)
CI<- confint(reg_model)%>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2497422 0.08880747 0.6760545
## sexM        2.7420380 1.69666006 4.4943946
## failures    1.6550376 1.08094265 2.5808643
## absences    1.0948038 1.04844539 1.1486025
## G3          0.9543647 0.88291082 1.0311936

The variables selected 1. Gender: Hypothesis from student’s sex we can predict the high alcohol consumption. (hypothesis that males will be more alcohol consumer than females) 2. Health: Hypothesis from the students’ health we can predict the high alcohol consumption. (hypothesis that high-use group will have lower health score than low-use group) 3. Absences: from the students’ grades we can predict the high alcohol consumption (hypothesis that high-use group will have more absences than low-use group) 4. G3: Hypothesis from the students’ grades we can predict the high alcohol consumption . (hypothesis that high-use group will have lower score than low-use group)

3.4 Numerical and Graphically explore variables

3.4.1. Gender relationships with alcohol consumption

#Bars of high_use and Sex
ggplot(alc, aes(x = high_use,  fill = sex))+ geom_bar() + xlab("Gender")+  ggtitle("Relation between Alcohol Consumption and Gender")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))

#Summarize high alcohol consumption by gender
 alc %>% select(high_use,sex) %>% tbl_summary(by=c("high_use"),percent = "row")
Characteristic FALSE, N = 2591 TRUE, N = 1111
sex
    F 154 (79%) 41 (21%)
    M 105 (60%) 70 (40%)
1 n (%)
  • According to gender : the males with high alcohol consumption (N=70) are more than females (N=42).
  • The results revealed that 40% of males are high alcohol consumption while only 21% of females are high alcohol consumption.
  • I accept my hypothesis

3.4.2. Health relationships with alcohol consumption

#  Boxplot of high_use and Health
 ggplot(alc, aes(x = high_use, y = health, col = high_use))+ geom_boxplot() +  ggtitle("Relation between Alcohol Consumption and health")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"));

ggplot(alc, aes(x = high_use, y = health, col = sex))+ geom_boxplot() +  ggtitle("Relation between Alcohol Consumption and health in both genders")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))

#Summarize relation between high alcohol consumption and health in  both genders 
alc %>% group_by(sex, high_use) %>% summarise(mean(health),sd(health));
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use `mean(health)` `sd(health)`
##   <chr> <lgl>             <dbl>        <dbl>
## 1 F     FALSE              3.37         1.40
## 2 F     TRUE               3.39         1.55
## 3 M     FALSE              3.67         1.41
## 4 M     TRUE               3.93         1.28
alc %>% group_by( high_use) %>% summarise(mean(health),sd(health))
## # A tibble: 2 × 3
##   high_use `mean(health)` `sd(health)`
##   <lgl>             <dbl>        <dbl>
## 1 FALSE              3.49         1.41
## 2 TRUE               3.73         1.40
  • The overall health score for high alcohol (3.73) is higher than low consumption (3.49) * The results showed that the mean health score for high-use alcohol in males is more than females (the mean health score for females=3.39 , males= 3.92).

  • In Males the the difference between low alcohol and high alcohol consumption students was higher in males(low=3.67, high= 3.93)

  • In Females group: the difference between low alcohol and high alcohol consumption students was small (low=3.37, high=3.39)

  • The results is different from my hypothesis

3.4.3. Absences relationships with alcohol consumption

 #  Boxplot of high_use and Absences
 ggplot(alc, aes(x = high_use, y = absences, col = high_use))+ geom_boxplot() +  ggtitle("Relation between Alcohol Consumption and Abscences")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"));

 ggplot(alc, aes(x = high_use, y = absences, col = sex))+ geom_boxplot() +  ggtitle("Relation between Alcohol Consumption and Abscences")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))

#Summarize relation between high alcohol consumption and absences in  both genders 
alc %>% group_by(sex, high_use) %>% summarise(mean(absences),sd(absences));
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use `mean(absences)` `sd(absences)`
##   <chr> <lgl>               <dbl>          <dbl>
## 1 F     FALSE                4.25           5.29
## 2 F     TRUE                 6.85           9.40
## 3 M     FALSE                2.91           2.67
## 4 M     TRUE                 6.1            5.29
alc %>% group_by( high_use) %>% summarise(mean(absences),sd(absences))
## # A tibble: 2 × 3
##   high_use `mean(absences)` `sd(absences)`
##   <lgl>               <dbl>          <dbl>
## 1 FALSE                3.71           4.46
## 2 TRUE                 6.38           7.06
  • The overall abscences in high-use is higher than the low-use group (low=3.7, high=6.4)
  • The mean students’ abscences in school among the high-use 6.85 in females and 6.1 in males
  • I accept my hypothesis

3.4.3. Grades relationships with alcohol consumption

#  Boxplot of high_use and G3
 ggplot(alc, aes(x = high_use, y = G3, col = high_use))+ geom_boxplot() + ylab("Grades")+  ggtitle("Relation between Alcohol Consumption and Grades")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"));

ggplot(alc, aes(x = high_use, y = G3, col = sex))+ geom_boxplot() + ylab("Grades")+  ggtitle("Relation between Alcohol Consumption and Grades in both Genders")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))

#Summarize relation between high alcohol consumption and Grades in  both genders 
alc %>% group_by(sex, high_use) %>% summarise(mean(G3),sd(G3));alc %>% group_by( high_use) %>% summarise(mean(G3),sd(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use `mean(G3)` `sd(G3)`
##   <chr> <lgl>         <dbl>    <dbl>
## 1 F     FALSE          11.4     3.24
## 2 F     TRUE           11.8     2.81
## 3 M     FALSE          12.3     3.45
## 4 M     TRUE           10.3     3.08
## # A tibble: 2 × 3
##   high_use `mean(G3)` `sd(G3)`
##   <lgl>         <dbl>    <dbl>
## 1 FALSE          11.8     3.35
## 2 TRUE           10.9     3.05
  • The mean grades for low alcohol consumption is 11.8 while the high alcohol consumption is 10.9 .
  • The males showed a large difference between both consumption groups (low=12.3, high=10.3) While the females there was only 0.4 difference (low=11.4 , high=11.8)
  • I accept my hypothsis

3.5 Create Logistic regression

# Create a logistic regression model
reg_model <- glm(high_use ~  sex +health + absences + G3, data = alc, family = "binomial")

# print summary of the model
summary(reg_model)
## 
## Call:
## glm(formula = high_use ~ sex + health + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2867  -0.8513  -0.6097   1.0669   2.1530  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.20679    0.58419  -2.066   0.0389 *  
## sexM         1.02053    0.24766   4.121 3.78e-05 ***
## health       0.06783    0.08880   0.764   0.4450    
## absences     0.09262    0.02333   3.970 7.19e-05 ***
## G3          -0.07619    0.03680  -2.070   0.0384 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 410.38  on 365  degrees of freedom
## AIC: 420.38
## 
## Number of Fisher Scoring iterations: 4
# Odds Ratio(OR) computation from coefficients
OR<-coef(reg_model) %>% exp

# compute confidence intervals (CI)
CI<- confint(reg_model)%>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2991572 0.09342207 0.9282541
## sexM        2.7746780 1.71811644 4.5452414
## health      1.0701795 0.90040767 1.2764711
## absences    1.0970499 1.05023715 1.1509892
## G3          0.9266362 0.86143669 0.9956498
  • The regression model showed statistically significant variables in Gender, abscence and Grades.
  • From the model the following show higher probability of high alcohol consumption :
  1. Gender is Males (OR= 2.74)
  2. Better Health score (OR= 1.07) But it is not signficant
  3. More absence (OR= 1.09)
  4. Lower grade (OR= 0.93)
  • the result are the same as previous stated hypothesis except for the health which is not significant predictor for consumption of alcohol among students

3.6 Prediction

# according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model

# Predict probability of high_use and add it to alc
probabilities <- predict(reg_model, type = "response")
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   247   12
##    TRUE     83   28
#plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability , y = high_use, col=prediction)) +geom_point()

#Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy  #line 700 in excercise
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2567568
  • The average number of wrong prediction in the training data is 25.7% . So this is a good model for prediction of alcohol consumption among students.

Chapter 4: Clustering and Classification

4.1 Create new R markdown file

# Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’

#Load Mass and corrplot libraries
library (MASS); library (corrplot)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:gtsummary':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## corrplot 0.92 loaded
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(ggplot2)

4.2 Load Data

# This is a so-called "R chunk" where you can write R code.
data("Boston")
#Explore the structure and the dimensions of the data 
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

These data show the Housing Values in Suburbs of Boston through 506 observations of 14 variables.

The data contains the following:

  1. per capita crime rate by town.
  2. proportion of residential zoned over 25,000 sq.ft.
  3. proportion of non-retail business acres per town.
  4. Charles River bounds
  5. nitrogen oxides concentration
  6. average number of rooms per dwelling.
  7. proportion of units built prior to 1940.
  8. weighted mean of distances to employment centres.
  9. index of accessibility to radial highways.
  10. property-tax rate per $10,000.
  11. pupil-teacher ratio by town.
  12. the proportion of blacks by town.
  13. percent of lower status of the population.
  14. median value of owner-occupied homes in $1000s.

4.3 Graphical presentation of data

# Calculate co relation
cor_matrix <- cor(Boston) 
#round(cor_matrix,digit=2)
cor_matrix %>% round(digits=2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
corrplot(cor_matrix, method="circle")

There is high variety in correlation among the variables : 1. High positive correaltion as rad and tax = 0.91, indus and tax = 0.72 2. High negative correlation as: dis & nox= -0.77, dis & age= -0.75 and dis & indus = -0.71 3. Moderate correlation as rad&crime=0.63, indus&rad= 0.60 4. Moderate correaltion rm &istat =-0.61 and age&zn= -0.57 5. Almost no correlation between some variables as zn&chas==-0.04 and chas&crime =-0.06,

4.4 Data Set Standardization

#  standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)

#summary of Crim
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
  • The variable is now standardized with mean=0 and standard deviation =1
# create a quantile vector of crim 
bins <- quantile(boston_scaled$crim)
                 
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##             127             126             126             127
# create labels for crime 
labels_crime <- c("low", "med_low", "med_high", "high") 

#  Same step with createing categorical variables of the crime rate in the Boston dataset `"low"`, `"med_low"`, `"med_high"`, `"high"` 
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels_crime)
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

train and test sets

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]


# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

4.5 linear discriminant analysis.

#  Fit the linear discriminant analysis on the train set. 
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2425743 0.2698020 0.2400990 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.09192475 -0.9912578 -0.11484506 -0.9091316  0.4689974 -0.9337820
## med_low  -0.09281428 -0.2633801 -0.07145661 -0.5430794 -0.1114517 -0.3448286
## med_high -0.37552361  0.1754278  0.19723335  0.3516234  0.1412393  0.3772003
## high     -0.48724019  1.0149946 -0.02879709  1.0504838 -0.4008802  0.8172071
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9115221 -0.6706366 -0.7394641 -0.50510906  0.38358421 -0.79864362
## med_low   0.3405493 -0.5564696 -0.4805604 -0.08880976  0.30948775 -0.13341954
## med_high -0.3550392 -0.4013159 -0.2877350 -0.21888875  0.08909738 -0.06684933
## high     -0.8630134  1.6596029  1.5294129  0.80577843 -0.71565108  0.81690022
##                 medv
## low       0.59249016
## med_low   0.01130012
## med_high  0.20222172
## high     -0.65370444
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.12253775  0.8031322109 -0.84083986
## indus    0.03557388 -0.4098911919  0.47762064
## chas    -0.09215050  0.0086198852  0.01101208
## nox      0.33657516 -0.7213826926 -1.48004910
## rm      -0.10951879 -0.1306524165 -0.13129028
## age      0.28902937 -0.2093621014 -0.17441950
## dis     -0.08865629 -0.3491559235  0.17167335
## rad      3.23880051  0.9819799036  0.12944735
## tax     -0.07112787 -0.0076741853  0.38309652
## ptratio  0.16400758  0.0008162476 -0.34086547
## black   -0.17145243  0.0076732967  0.11909976
## lstat    0.15533992 -0.2430963803  0.36092175
## medv     0.18066476 -0.3568682907 -0.21414092
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9424 0.0446 0.0130
# Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. 

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
  }


# target classes as numeric
classes <- as.numeric(train$crime)

#Draw the LDA (bi)plot.

# plot the lda results
plot(lda.fit, dimen = 2); lda.arrows(lda.fit, myscale = 1)

Prior probabilities of groups: low= 24.5% med_low=25.2% med_high= 25% high=25.2% 0.2450495 0.2524752 0.2500000 0.2524752 The percentage separation achieved by each linear discriminant function are for LD1: 94.4 %, LD2: 4.2 % and for LD3: 1.4 %

4.6 Predict from LDA model

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)


# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      14        3    0
##   med_low    3      19        6    0
##   med_high   0       3       14    0
##   high       0       0        1   29

Prior probabilities of groups: low= 24.5% med_low=25.2% med_high= 25% high=25.2% 0.2450495 0.2524752 0.2500000 0.2524752 The percentage separation achieved by each linear discriminant function are for LD1: 94.4 %, LD2: 4.2 % and for LD3: 1.4 %

  • The LDA model correctly predicts the crime 72 times (71.28%)

4.7 Reload Data and Standardize Dataset

#Reload the Boston dataset and standardize the dataset 
# load the data
data("Boston")

#scale  dataset
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

#calculate distance between observations
dist_eu <- dist(boston_scaled, method = "euclidean")

#summary of distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
#Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again

# k-means clustering
km <- kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal

4.8 Bonus

# k-means clustering =5
km_2 <- kmeans(boston_scaled, centers = 5)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km_2$cluster)

# linear discriminant analysis
lda.fit_2 <- lda(km_2$cluster ~ ., data = boston_scaled)

# print the LDA.fit_2 
lda.fit_2
## Call:
## lda(km_2$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##          1          2          3          4          5 
## 0.05731225 0.13043478 0.26482213 0.24901186 0.29841897 
## 
## Group means:
##         crim          zn       indus        chas        nox         rm
## 1  3.0022987 -0.48724019  1.01499462 -0.27232907  1.0593345 -1.3064650
## 2 -0.4129151  2.21369458 -1.14073862 -0.21267603 -1.1776883  0.6268583
## 3 -0.3927690 -0.05350915 -0.72731163  0.05086573 -0.5787393  0.6963941
## 4  0.3838378 -0.48724019  1.10691918  0.10263286  1.1951864 -0.2747091
## 5 -0.3678595 -0.41994296  0.02544269  0.01447956 -0.1724217 -0.4118452
##          age        dis        rad        tax    ptratio      black      lstat
## 1  0.9805356 -1.0484716  1.6596029  1.5294129  0.8057784 -1.1906614  1.8708759
## 2 -1.4768042  1.6629608 -0.6477717 -0.5460224 -0.7465036  0.3461285 -0.9418899
## 3 -0.4958014  0.2904872 -0.5584811 -0.7610166 -0.5668389  0.3659840 -0.7643523
## 4  0.7423860 -0.8065257  1.2549050  1.3202837  0.4886767 -0.5898468  0.6409434
## 5  0.2776840 -0.1102834 -0.5871332 -0.4814240  0.2667868  0.2447916  0.1958522
##         medv
## 1 -1.3102002
## 2  0.7058131
## 3  0.7824526
## 4 -0.5359120
## 5 -0.3040503
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3           LD4
## crim    -0.427666835 -0.199511558 -1.53952452 -0.4079654643
## zn       0.145299886 -1.666731999 -0.13266159  0.8941701291
## indus   -0.344632503  0.007043196  0.31049852  0.4203611322
## chas    -0.078704824  0.058245348  0.03305941  0.0762455497
## nox     -1.122343755 -0.742824398  0.62047881 -0.1461192814
## rm       0.460578250 -0.082860834  0.46972900 -0.3594922840
## age     -0.004009447  0.638346347 -0.09833009  0.5748989064
## dis     -0.007017325 -0.592461290  0.20900264 -0.0006883399
## rad     -1.203224091  0.019174331  0.63799168 -0.9675794050
## tax     -0.827069577 -1.161729561  0.30529810  0.3353654284
## ptratio -0.120673757  0.077914970 -0.07831677  0.5866415075
## black    0.121519360  0.054015976  0.06837498 -0.0012996769
## lstat   -0.434053312 -0.108659705 -0.55646297  0.0223022252
## medv    -0.145642897 -0.282461279 -0.38632976 -0.3660128703
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.7362 0.1819 0.0571 0.0248
# plot the lda results
plot(lda.fit_2, dimen = 2, pch = km_2$cluster)
lda.arrows(lda.fit_2, myscale = 2)


Chapter 5: Dimensionality reduction techniques

5.1 Graphical overview and Summary

# Install  Packages
library(GGally)

# Read the CSV file 
human<- read.csv("data/human.csv",row.names=1)

#summary of variables
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#Graphical overview
ggpairs(human, progress = FALSE )

The distributions of the variables and the relationships between them:

  • Data for all data are skewed either positive or negative except Education experience

  • The correlation between variables are varied:

  1. positive high correlation: between life exp & edu Exp, life exp & GNI, Exdu Ex & GNI and Ado Birth & Mat.Mor

  2. Negative high corelation: between life exp & Mat. Mor, life exp & Ado Birth, Mat Mor & Edu 2 FM

  3. Moderate positive between (edu EXp & Edu 2 Fm , life EXp & Edu 2 FM, GNI & Edu2 FM,)

  4. Moderate negative correlation between (ADo birth& Edu2FM, GNI & Mat Mor, and GNI & MatMor)

  5. parli. F. is the least correlated to all the other factors.

5.2. principal component analysis (PCA)

# Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

# perform principal component analysis (with the SVD method)
pca_non<-prcomp(human)

#summary of PCA
summary(pca_non)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_non, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = NA, ylab = NA)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

5.3 Standardize the variables

# Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables

summary (human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

#summary of PCA
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
summary(pca_non)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = NA, ylab = NA)

If the variables are in different units, it’s recommended to standardize data because a change in units will change the results and it’s hard to see.

5.4 personal interpretations

# Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
  • Stardaizing the data produce a more clear visualization of all variations in PCA

  • Proportions of variance and cumulative proportion can now interpretated easily

  • arrows are now visible to show relation between variables

5.5 Multiple Correspondence Analysis (MCA)

# Load library

library(dplyr)
library(tidyr)

# load data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)


# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time[, 1:5], graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time[, 1:5], graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.335   0.309   0.257   0.224   0.198   0.184   0.172
## % of var.             16.762  15.441  12.866  11.181   9.924   9.188   8.625
## Cumulative % of var.  16.762  32.204  45.069  56.250  66.174  75.362  83.987
##                        Dim.8   Dim.9  Dim.10
## Variance               0.141   0.105   0.075
## % of var.              7.031   5.249   3.734
## Cumulative % of var.  91.017  96.266 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.325  0.105  0.088 | -0.288  0.090  0.069 |  0.305
## 2                  | -0.259  0.067  0.037 | -0.074  0.006  0.003 |  0.789
## 3                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 4                  | -0.580  0.334  0.481 | -0.328  0.116  0.154 | -0.290
## 5                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 6                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 7                  | -0.403  0.162  0.242 | -0.283  0.086  0.119 |  0.217
## 8                  | -0.259  0.067  0.037 | -0.074  0.006  0.003 |  0.789
## 9                  |  0.155  0.024  0.012 |  0.974  1.025  0.461 | -0.005
## 10                 |  0.521  0.270  0.142 |  0.845  0.770  0.373 |  0.612
##                       ctr   cos2  
## 1                   0.120  0.078 |
## 2                   0.806  0.343 |
## 3                   0.061  0.070 |
## 4                   0.109  0.120 |
## 5                   0.061  0.070 |
## 6                   0.061  0.070 |
## 7                   0.061  0.070 |
## 8                   0.806  0.343 |
## 9                   0.000  0.000 |
## 10                  0.485  0.196 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.294   0.073   4.681 |   0.198   0.627   0.013
## Earl Grey          |  -0.265   2.689   0.126  -6.147 |   0.088   0.319   0.014
## green              |   0.487   1.558   0.029   2.962 |  -0.956   6.515   0.113
## alone              |  -0.017   0.011   0.001  -0.405 |  -0.251   2.643   0.117
## lemon              |   0.668   2.926   0.055   4.059 |   0.458   1.497   0.026
## milk               |  -0.338   1.428   0.030  -3.010 |   0.220   0.659   0.013
## other              |   0.287   0.148   0.003   0.873 |   2.207   9.465   0.151
## tea bag            |  -0.607  12.466   0.482 -12.008 |  -0.336   4.134   0.147
## tea bag+unpackaged |   0.348   2.261   0.055   4.063 |   1.000  20.281   0.456
## unpackaged         |   1.959  27.484   0.524  12.511 |  -1.025   8.173   0.143
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                1.961 |   1.045  20.945   0.358  10.342 |
## Earl Grey            2.033 |  -0.462  10.689   0.386 -10.737 |
## green               -5.813 |   0.360   1.110   0.016   2.190 |
## alone               -5.905 |   0.150   1.136   0.042   3.534 |
## lemon                2.787 |  -1.561  20.842   0.301  -9.491 |
## milk                 1.963 |   0.093   0.141   0.002   0.828 |
## other                6.712 |   1.826   7.773   0.103   5.552 |
## tea bag             -6.637 |   0.069   0.211   0.006   1.367 |
## tea bag+unpackaged  11.677 |  -0.050   0.061   0.001  -0.586 |
## unpackaged          -6.548 |  -0.196   0.357   0.005  -1.249 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.115 0.421 |
## How                | 0.076 0.220 0.385 |
## how                | 0.708 0.503 0.008 |
## sugar              | 0.065 0.004 0.412 |
## where              | 0.701 0.702 0.061 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

The plot above helps to identify variables that are the most correlated with each dimension in different categories:

  • People drink green tea which is unpackaged from a tea shop

  • People drink earl grey tea in the form of tea bags from chain store

  • People drink earl grey tea in with milk and sugar

  • People drink black tea with no sugar and with lemon

  • Other catogeries unidentified (chain anfd tea shops) with (teabags and unpacked)


Chapter 6: Analysis of Longitudinal data

6.1 Meet and Repeat: PART I but using the RATS data

  • the analyses of Chapter 8 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART I but using the RATS data (from Chapter 9 and Meet and Repeat: PART II). (0-7 points: 0-4 points for graphs or analysis results + 0-3 points for their interpretations)
# Load packages 
library(readr)
library(dplyr)
library(ggplot2)
RATSL <- read.csv("data/RATSL.csv")
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# Convert categorical variables in RATS to factors.
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

Graphical Presentation

# Draw RATSL plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:8, times=2)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

Standardizing Data

# standardise RATSL variable 
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# Plot standardized RATSL
 ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:8, times=2)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized RATSL")

Interpretation * Standardization reduces weight variation. The follow-up of all treatment groups. Group 2 and Group 3 gain weight more than in group 1. Group 2 show outlier in its readings

summary RATSL with mean and standard error of the variable

n=8
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Interpretation The standard error of mean is shown in this graph which showed the heighest variance in group 2 and the lowest for group 1 . The group 3 SE is decreasing by time. ###

# # Create a summary data with mean (ignoring baseline )
 RATSLX <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# summary of the RATSLX
summary(RATSLX)
##  Group       ID          mean      
##  1:8   1      : 1   Min.   :238.9  
##  2:4   2      : 1   1st Qu.:267.4  
##  3:4   3      : 1   Median :360.1  
##        4      : 1   Mean   :386.3  
##        5      : 1   3rd Qu.:505.4  
##        6      : 1   Max.   :594.0  
##        (Other):10
# box plot for RATSLX
ggplot(RATSLX, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "(Weight)/ days ")

Remove outliers

# Comment
 # create a new data by removing the oulier 
RATSLX2 <- RATSLX %>% filter(mean < 550)
# draw a boxplot of without the outliers 
ggplot(RATSLX2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "(Weight)/ days ")

analysis of variance (ANOVA)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ Group, data = RATSLX2)
 anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 207659  103830  501.81 2.721e-12 ***
## Residuals 12   2483     207                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation ANOVA test showed that there is highly signficant difference in the weight of the 3 study groups

6.2 Meet and Repeat: PART II using the BPRS data

# Read BPRSL dataset
BPRSL <- read.csv("data/BPRSL.csv")
# Factor treatment & subject
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

text continues here.

Standardize BPRS

# Standardise the variable bprs
BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = (bprs-mean(bprs))/sd(bprs)) %>%
  ungroup()
# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stdbprs   <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.6983632, 0.00…
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise( mean=mean(bprs) ) %>%
  ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(BPRSL8S)
## Rows: 40
## Columns: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ mean      <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.125, 30.5…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

Create Regression model for BPRS

# create a regression model for BPRS
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# access library lme4
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# summary of the intercept model 
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

text #### reate a random intercept and random slope model

# create a random intercept and random slope model
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
#summary of model 
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

ANOVA for both models

# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref2    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA test showed a statistically signficance across time . So the new model created is firt better than the first one

Create random intercept and random slope with interaction

# create a random intercept and random slope model with the interaction
BPRS_ref3 <- lmer(bprs ~ week + treatment + week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# Comment

text

ANOVA for the models

# perform an ANOVA test on the two models
 anova(BPRS_ref3, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | subject)
## BPRS_ref3: bprs ~ week + treatment + week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref2    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref3    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not statistically signficant at p<0.05 . So the third model not differ from the second one

Draw Plots

#BPRSL$subject3<- as.numeric(BPRSL$subject)
# draw the plot of BPRSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)), name = "Observed BPRS")

text

Plot Fitted values

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref3)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)
# draw the plot of BPRSL with the Fitted values of weight
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)), name = "Fitted BPRS")

The fitted model has less individual variations in BPRS